initial work to reconcile devices from launcher pods #91

ibrokethecloud · 2024-09-13T00:43:06Z

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Problem:

Rancher allows provisioning of downstream clusters to leverage vGPUs.

When used with a machine deployment of more than 1 node, the actual gpu allocated is different from name specified since kubevirt leverages the DeviceName field to calculate launcher pod resource requirements, which are then subsequently used by schedular to identify correct nodes.

Since the actual name is not really used for device allocation, any random string can be used and this makes it difficult to track vGPU allocation in the cluster.

Solution:

PR introduces a minor change to leverage pod environment variables being set by the device plugin during the ContainerAllocateResponse. The additional vmi controller execs into the launcher pod to identify the ID set for resource, and subsequently map it to the GPU/Hostdevice resources. Once the devices are identify an annotation is set on the VM to provide info about actual device from a pool of devices passed through to the VM

For example for a host device this looks as follows:

harvesterhci.io/deviceAllocationDetails: '{"hostdevices":{"intel.com/8d26":["vgpu-01-0000001d0"]}}'

For GPU devices ithis looks as follows:

  harvesterhci.io/deviceAllocationDetails: '{"gpus":{"nvidia.com/NVIDIA_A2-2Q":["vgpu-01-000008005"]}}'

When a VM is shutdown the annotation is used to replace name for vGPU or host devices with actual device names from the deviceAllocationDetails annotation. This ensures that the VM can be edited and devices can be removed from Harvester UI post provisioning.

Related Issue:

Test plan:

ibrokethecloud · 2024-09-13T01:36:45Z

backend changes related to: rancher/dashboard#11399

Yu-Jack

LGTM, and just a reminder. We need to raise a chart PR for updating the service account permission to allow that pcidevices-controller can list/update virtualmachines and list virtualmachineinstances.

pkg/controller/virtualmachine/virtualmachine.go

WebberHuang1118

LGTM, just a nit, thanks.

pkg/controller/virtualmachine/virtualmachine.go

WebberHuang1118

LGTM, thanks for the PR

initial work to reconcile devices from launcher pods refactor vmi reconcile to pass codefactor check refactor code and change validation logic to use new annotations modified vmi/vm reconcile to also update vm with correct device names based on vgpu annotation when VM is stopped include pr review feedback fine tune logic for removing duplicates from allocation annotation

bk201 requested review from Yu-Jack and WebberHuang1118 September 13, 2024 01:15

This was referenced Sep 16, 2024

Improve vGPU allocation rancher/dashboard#11399

Merged

Add provisioned vGpus in VM's devices list. harvester/dashboard#1154

Merged

Yu-Jack approved these changes Sep 23, 2024

View reviewed changes

pkg/controller/virtualmachine/virtualmachine.go Outdated Show resolved Hide resolved

pkg/controller/virtualmachine/virtualmachine.go Outdated Show resolved Hide resolved

pkg/controller/virtualmachine/virtualmachine.go Outdated Show resolved Hide resolved

ibrokethecloud force-pushed the plugin-device-tracking branch from 6b03b33 to b4488ba Compare September 23, 2024 09:29

ibrokethecloud added a commit to ibrokethecloud/harvester-charts that referenced this pull request Sep 23, 2024

charts PR to go with harvester/pcidevices#91

de01989

ibrokethecloud mentioned this pull request Sep 23, 2024

pcidevices controller plugin device tracking changes harvester/charts#290

Merged

ibrokethecloud force-pushed the plugin-device-tracking branch from b4488ba to 46ee5eb Compare September 23, 2024 09:37

WebberHuang1118 reviewed Oct 1, 2024

View reviewed changes

pkg/controller/virtualmachine/virtualmachine.go Show resolved Hide resolved

WebberHuang1118 self-requested a review October 1, 2024 08:53

WebberHuang1118 approved these changes Oct 2, 2024

View reviewed changes

ibrokethecloud force-pushed the plugin-device-tracking branch from b12b815 to 70f68d2 Compare October 2, 2024 02:34

ibrokethecloud merged commit b913aa8 into harvester:master Oct 2, 2024
5 checks passed

ibrokethecloud added a commit to harvester/charts that referenced this pull request Oct 6, 2024

charts PR to go with harvester/pcidevices#91

0228384

ibrokethecloud added a commit to ibrokethecloud/harvester-charts that referenced this pull request Oct 7, 2024

charts PR to go with harvester/pcidevices#91

4f53451

ibrokethecloud mentioned this pull request Oct 7, 2024

Pcidevices 0.4.1 harvester/charts#296

Merged

ibrokethecloud added a commit to harvester/charts that referenced this pull request Oct 7, 2024

charts PR to go with harvester/pcidevices#91

3f8531f

jillian-maroket mentioned this pull request Oct 31, 2024

Document improved vGPU allocation in Harvester and Rancher harvester/docs#661

Open

Yu-Jack mentioned this pull request Nov 6, 2024

[BUG] VM with assigned PCI GPU can not start after v1.3.2 to v1.4.0-rc4 upgrade due to no /dev/vfio/xx path harvester/harvester#6892

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial work to reconcile devices from launcher pods #91

initial work to reconcile devices from launcher pods #91

ibrokethecloud commented Sep 13, 2024 •

edited

Loading

ibrokethecloud commented Sep 13, 2024

Yu-Jack left a comment •

edited

Loading

WebberHuang1118 left a comment

WebberHuang1118 left a comment

initial work to reconcile devices from launcher pods #91

initial work to reconcile devices from launcher pods #91

Conversation

ibrokethecloud commented Sep 13, 2024 • edited Loading

ibrokethecloud commented Sep 13, 2024

Yu-Jack left a comment • edited Loading

Choose a reason for hiding this comment

WebberHuang1118 left a comment

Choose a reason for hiding this comment

WebberHuang1118 left a comment

Choose a reason for hiding this comment

ibrokethecloud commented Sep 13, 2024 •

edited

Loading

Yu-Jack left a comment •

edited

Loading